In [17]:
import pandas as pd
import matplotlib.pyplot as plt
% matplotlib inline
import seaborn as sns
plt.style.use('fivethirtyeight')
import math
import numpy as np
scores = pd.read_csv('sat_scores.csv')
scores.head(52)
Out[17]:
In [5]:
scores.Verbal.value_counts()
scores.Math.value_counts()
Out[5]:
The data lists the rate of participation and mean verbal/math scores of students who took the SAT in 2001 broken down by states.
Initial Analysis:
From my initial observations I look at row 51 and assume that the 45 in the rate column (and 506-Verbal and 514-Math) is the mean of the data in rows 0-50. Iowa and North Dakota, which are in the bottom 5 for rate of participation (at 47th and 48th, respectively), have some of the highest Math and Verbal mean scores. So, rate of participation is not indicative of strong scores. Also, I want to ask the College Board if they want to rank it by participation rate, which is the current method. Or would they prefer ranking the states by highest mean Verbal and Math scores for their presentation this year?
In [159]:
scores.describe()
Out[159]:
In [9]:
#Sorting Verbal scores in ascending order
scores.sort('Verbal')
Out[9]:
Creating a list of State names extracted from the data.
In [65]:
list(scores.State)
Out[65]:
In [29]:
### Checking that types are accurate.
scores.dtypes
Out[29]:
In [32]:
drate = {}
dverbal = {}
dmath = {}
for x in scores.index:
drate[scores.State.ix[x]] = scores.Rate.ix[x]
dverbal[scores.State.ix[x]] = scores.Verbal.ix[x]
dmath[scores.State.ix[x]] = scores.Math.ix[x]
print drate
print ''
print dverbal
print ''
print dmath
In [33]:
print 'The minimum Rate is' + ' ' + str(min(scores.Rate))
print 'The maximum Rate is' + ' ' + str(max(scores.Rate))
print 'The minimum Verbal score is' + ' ' + str(min(scores.Verbal))
print 'The maximum Verbal score is' + ' ' + str(max(scores.Verbal))
print 'The minimum Math score is' + ' ' + str(min(scores.Math))
print 'The maximum Math score is' + ' ' + str(max(scores.Math))
In [38]:
def std(col):
std = math.sqrt(sum((scores[col] - np.mean(scores[col])) ** 2) / (len(scores) - 1))
return std
print('Standard Deviation for Rate of Participaion is ' + str(std('Rate')))
print('Standard Deviation for Average Verbal Score is ' + str(std('Verbal')))
print('Standard Deviation for Average Math Score is ' + str(std('Math')))
In [27]:
scores.Rate.plot(kind='hist', bins=5, title='Histogram of Rates of Participation')
Out[27]:
In [74]:
import matplotlib.pyplot as plt
% matplotlib inline
scores.Rate.order().values
Out[74]:
In [18]:
scores.Math.plot(kind='hist', bins=6, title='Mean SAT Math Scores for 2001')
plt.xlabel('Math Score')
plt.ylabel('Frequency')
Out[18]:
In [19]:
scores.Verbal.plot(kind='hist', bins=7, title='Mean Verbal SAT Scores for 2001')
plt.xlabel('Verbal Score')
plt.ylabel('Frequency')
Out[19]:
In [20]:
#Checking out the data with a density plot
scores.Verbal.plot(kind='density', xlim=(300, 700))
Out[20]:
Typical assumption for data distribution is a normal distribution
It does not hold true for the above 3 histograms representing Rate, Verbal and Math. Rate is high on both ends of the graph. Math peaks at the low 500s. While Verbal peaks around 510 and 570.
In [21]:
import matplotlib.pyplot as plt
scores.plot(kind='scatter', x='Verbal', y='Math', alpha=0.5)
Out[21]:
In [22]:
scores.plot(kind='scatter', x='Verbal', y='Rate', alpha=0.5)
Out[22]:
In [23]:
scores.plot(kind='scatter', x='Math', y='Rate', alpha=0.5)
Out[23]:
Both the Verbal and Math mean scores were postively correlated. As one increases so did the other. In the scatterplot for Verbal and Rate chart, the higher scores were in the lowest rate of student participation. The scores were in closer proximity when compared to the scores displayed in the top left of the graph. As rates of participation increased the scores became more spreadout and lower. This held true (for the most part-excluding a couple or outliers) for the scatterplot for Math and Rate. This implies that higher rate of participation does not guarentee higher scores. The more student that take the SATs the more likely it is that the mean scores will decrease.
In [24]:
scores.Rate.plot(kind='box')
Out[24]:
In [25]:
scores.Verbal.plot(kind='box')
Out[25]:
In [26]:
scores.Math.plot(kind='box')
Out[26]: